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C^l . Abstract 



Given an arbitrary bitstream, we consider the problem of finding the longest substring 
whose ratio of ones to zeroes equals a given value. The central result of this paper is an 
algorithm that solves this problem in linear time. The method involves (i) reformulating 
the problem as a constrained walk through a sparse matrix, and then (ii) developing a data 
structure for this sparse matrix that allows us to perform each step of the walk in amortised 
X^/^ • constant time. We also give a linear time algorithm to find the longest substring whose 

ratio of ones to zeroes is bounded below by a given value. Both problems have practical 
relevance to cryptography and bioinformatics. 



1 Introduction 

(N 

Consider a bitstream of length n, that is, a sequence of bits x±,X2, ■ ■ ■ ,x n where each Xi is 
or 1 . We define the density of this bitstream to be the proportion of bits that are equal to one 
(equivalently, ^ Xi/n). The density always lies in the range [0,1]: a stream of zeroes has density 
^y-^ . 0, a stream of ones has density 1, and a stream of random bits should have density close to |. 

In this paper we are interested in the densities of substrings within a bitstream. By a sub- 
string, we mean a continuous sequence of bits x a , x a+ \ , . . . , Xb-i, Xb, beginning at some arbitrary 
' position a and ending at some arbitrary position b. The length of a substring is the number of 

bits that it contains (that is, b—a+1), and the density of the substring is likewise the proportion 

of ones that it contains (that is, Y^i= a x i/(b ~ a + !))■ 

In particular, we arc interested in the following two problems: 

rS , 

Problem 1.1 (Fixed density problem). Suppose we are given a bitstream S of length n and a 
fixed ratio 9 G [0, 1]. What is the longest substring of S whose density is equal to 9? 

Problem 1.2 (Bounded density problem). Suppose we are given a bitstream S of length n and 
a fixed ratio 9 G [0, 1]. What is the longest substring of S whose density is at least 9? 

For example, suppose we are given the bitstream S = 010110101100 of length n = 12. Then 
the longest substring with density equal to 9 = 0.6 has length ten (0 1O11O1O11O 0), and the 
longest substring with density at least 9 = 0.7 has length seven fOlO llOlOll OO). Note that 
each problem might have many solutions or no solution at all. 

Both of these problems have important applications for cryptography. Many cryptographic 
systems are dependent on pseudo-random number generators (PRNGs), and any unwanted 
predictability or structure in a PRNG becomes a potential attack point for the underlying 
cryptosystem. For this reason PRNGs are typically subjected to a stringent series of randomness 
tests, such as those described in [13] or [15] . 

Bozta§ et al. have recently designed a new series of randomness tests based on the densities 
of substrings [3j. To construct these tests, they use the Erdos-Renyi law of large numbers [11 E] 
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to compute the limiting distributions for solutions to the fixed density problem, the bounded 
density problem and related problems. They then compare observed values against these limiting 
distributions, and they have identified a possible weakness in the Dragon stream cipher [4] as a 
result. 

Locating substrings with various density properties also has important applications in bioin- 
formatics. A sequence of DNA consists of a long string of nucleotides marked G, C, T or A, 
and subsequences with high proportions of G and C are called GC-rich regions. GC-richness is 
correlated with factors such as gene density |18) , gene length [5] , recombination rates [8] , codon 
usage [17] . and the increasing complexity of organisms [2l [TT] . 

To identify GC-rich regions we convert a DNA sequence into a bitstream, where each G or 
C becomes a one bit, and each T or A becomes a zero bit. We then search for high-density 
substrings in this bitstream, using techniques such as those discussed here. 

Further applications of density problems in the field of bioinformatics are discussed by Gold- 
wasser et al. [9] and Lin et al. [14] . In addition, Greenberg [10] signals potential applications in 
the field of image processing. 

The focus of this paper is on finding fast algorithms to solve Problems 11.11 and 11.21 Both 
problems allow simple brute-force algorithms that run in 0{n 2 ) time. For the fixed density 
problem, Bozta§ et al. improve on this with their SkipMisMatch algorithm [3 , which remains 
0(n 2 ) in the worst case but has an improved average-case time complexity of 0(n log n). We 
outline their contribution in Section [2] 

Our first contribution in this paper is a series of simple algorithms that solve both the fixed 
and bounded density problems in O(nlogn) time, even in the worst case. These algorithms are 
easy to implement and effective in practice, and are based upon a central geometric observation. 
We cover these log-linear algorithms in Section [3] 

In Section [4] we follow with our main result, which is an algorithm that solves the fixed 
density problem in 0(n) time, again in the worst case. Based on one of the previous log-linear 
algorithms, this algorithm introduces a specialised data structure that allows us to process each 
bit of the bitstream in amortised constant time. Broadly speaking, we: 

• express our bitstream as a sequence of steps through a sparse matrix, where each step 
requires a localised search and possible insertion into this matrix; 

• design a specialised data structure that "compresses" this sparse matrix, so that each 
localised search and insertion can be performed in amortised constant time. 

The amortised analysis is based on aggregation — in essence we count the "interesting" steps of 
the algorithm by associating them with distinct elements of the bitstream, thereby showing the 
number of such steps to be 0(n). Details of the proof are given in Section l4~3l 

Our final contribution is in Section [5] where we give an 0{n) time algorithm for the bounded 
density problem. In contrast to the fixed density problem, this final algorithm is quite simple, 
involving just a handful of linear scans. 

To conclude, we measure the practical performance of our algorithms in Section [6] It is 
reassuring to find that our linear algorithms are worth the extra difficulty, consistently outper- 
forming the other algorithms for large bitstream lengths n. 

In related work, several authors have considered problems of finding maximal density sub- 
strings in a bitstream subject to a variety of constraints. See in particular work by Lin et al. 
[14] , who place a lower bound on the length of the substring; Goldwasser et al. [9] , who improve 
the prior solution and also place both lower and upper bounds; and Greenberg [10], who studies 
a variant relating to compressed bitstreams. 

Hsieh et al. [12] study a series of more general problems, where the bitstream is replaced 
by a sequence of real numbers, and the density of a substring becomes the average of the 
corresponding subsequence. In addition to developing algorithms, they show that several such 
problems — including the fixed density problem — have a lower bound of tt(n\ogn) time. Our 
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linear algorithm effectively breaks through this lower bound in the case where the input sequence 
consists entirely of zeroes and ones. 

Throughout this paper we measure time complexity in "number of operations" , where we 
treat basic arithmetical operations such as + and x as constant-time. 

2 Quadratic Algorithms: Bozta§ et al. 

In this section we outline the prior work of Bozta§ et al., including a simple 0(n 2 ) brute force 
algorithm as well as their SkipMisMatch algorithm, which remains 0(n 2 ) in the worst case but 
becomes 0(n log n) in the average case. 

Assumption 2.1. Throughout this paper we assume that the ratio 9 is given as a rational 
9 = aj (3, where a and /? are integers in the range < a < f3 < n, and where gcd(a, j3) = 1. 

This assumption is not restrictive in any way. If 9 cannot be expressed as above then the 
fixed density problem has no solution, and for the bounded density problem we can harmlessly 
replace 9 with a nearby rational that satisfies our requirements. 

A naive brute force solution runs in 0(n 3 ) time: for each possible start point and end point, 
walk through the substring and count the ones. However, there are several different tricks that 
can easily convert this into 0(n 2 ) by replacing "walk through the substring" with a constant 
time operation. One such trick is to use a rank table. 

Definition 2.2 (Rank Table). A rank table is an array ro,r\, . . . , r n , where each entry counts 
the number of ones in the substring x\, . . . , Xk- 

In other words, = %i- It is clear that the complete rank table can be precomputed 

in 0(n) time, and that it supports constant time queries of the form "how many ones appear 
in the substring x a , . . . , Xfc?" by simply computing — r a _i. 

For the fixed density problem, the SkipMisMatch algorithm further optimises this 0(n 2 ) 
brute force method by making the following observations: 

(i) We are searching for the longest substring of density 9. We can therefore reorganise our 
search to work from the longest substring down to the shortest, allowing us to terminate 
as soon as we find any substring of density 9. 

(ii) If we find such a substring, its length must be a multiple of /? (where 9 = a/ f3 as above). 
We can therefore restrict our search to substrings of such lengths. 

(iii) When searching for substrings of length k/3, we need to find precisely ka ones to give a 
density of 9. If at some point we find ka ± e ones, we must step forward at least e positions 
in our bitstream before we can "undo the error" and potentially find the ka ones that we 
seek. 

Bundling these observations together, we obtain the SkipMisMatch algorithm as illustrated 
in Figure [T] The worst-case complexity is clearly still 0(n 2 ), but for a random bitstream the 
expected performance can be significantly better. In particular, Bozta§ et al. prove the following 
result as a part of [3l Lemma 4] : 

Lemma 2.3. Suppose we have a random bitstream, where each bit is one with probability p or 
zero with probability 1 — p. Then SkipMisMatch has expected time bounded by 
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procedure SkipMisMatch (a;i, . . . , x n , 8 = at//3) 

Build a rank table rg, r%, , r n 

for J downto 1 do 

(a,&)<-(l,fcj8) 



> Search for substrings of length k/3 
> Initial start and end for our substring 



while b < n do 

e «— |fca — (rj, — r a -i)\ 
if e = then 



> Compute the "error" for this substring 



Output (a, b) and terminate 
else 



(a, b) <r- (a + e,b + e) 



> We can safely skip forward e positions 



Output "no such substring" and terminate 



Figure 1: The SkipMisMatch algorithm for the fixed density problem 



For fixed 8 and p, this reduces to an expected time of 0(n log n), as long as 8 ^ p. However, 
if we retain the dependency on 8 (and hence its denominator j3), we find that SkipMisMatch 
is rewarded by large denominators /3 (which enhance the power of optimisation (|u|)). and is 
penalised by values of 8 close to p (which limit the use of optimisation ([HI])). 

To summarise, the SkipMisMatch algorithm is easy to code and runs significantly faster than 
brute force, but its performance depends heavily on the given value of 8. In addition, some 
broader issues might arise — the expected 0(n log n) time is appropriate for random bitstreams 
(as found in cryptographic applications, for instance), but might not hold for applications such as 
bioinformatics and image processing where bitstreams become more structured. Moreover, the 
algorithm does not translate well to the bounded density problem. All of these reasons highlight 
the need for faster and more robust algorithms, which form the subject of the remainder of this 
paper. 

3 Log-Linear Algorithms: Maps and Sorting 

In this section we introduce our first truly sub-quadratic algorithms for solving the fixed and 
bounded density problems. We describe DistMap, a simple algorithm involving a map structure, 
and DistSort, a variation that replaces this map with a sort and a linear scan. Both of these 
algorithms run in O(nlogn) time, even in the worst case. 

Although we present even faster algorithms in Sections |4] and [5] both DistMap and DistSort 
are simple to describe and easy to implement. Moreover, both algorithms play important roles: 
DistMap is the foundation upon which the linear algorithm of Section |4] is built, and DistSort is 
a more flexible variant that can solve both the fixed and bounded density problems. 

3.1 Graphical Representations 

Our first step in developing these sub-quadratic algorithms is to find a graphical representation 
for our bitstreams. 

Definition 3.1 (Grid Representation). We can plot any bitstream as a walk through an infinite 
two-dimensional grid as follows0 We begin at the origin (0,0), and then step one unit in the 
x-direction each time we encounter a zero, or one unit in the y-direction each time we encounter 
a one, as illustrated in Figure [2] We refer to this as the grid representation of the bitstream. 

All of the new algorithms developed in this paper are based upon the following simple 
geometric observation: 

1 This is related to, but not the same as, the walk through the sparse matrix that we use for the linear 
algorithm in Section [4] 
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Figure 2: The grid representation for the bitstream 010110101100 



Lemma 3.2. A substring of a bitstream has density 9 if and only if the line joining its start 
and end points in the grid representation has gradient . 

To illustrate, Figure |3] builds on the previous example by searching for substrings of density 
9 = 0.6. Several pairs of points separated by gradient = 1.5 are marked (though there are 
several more such pairs that are not marked). The first two pairs correspond to substrings of 
length five, and the third pair corresponds to a substring of length ten. 




Gradient 



1.5 



Figure 3: Pairs of points that represent substrings of density 9 = 0.6 

We can find such pairs of points by drawing a line Lg through the origin with slope jzzg, 
and then measuring the distance of each point from this line (where distances are signed, so 
that points above or below the line have positive or negative distance respectively). This is 
illustrated in Figure 2J It is clear that two points are joined by a line of gradient if and 
only if their distances from Lg are the same. 



■Lg 




(origin) 



Figure 4: Measuring the distance of each point from the line Lg 

Although such distances can be messy to compute, with appropriate rescaling we can convert 
them into integers as follows. 

Definition 3.3 (Distance Sequence). Recall from Assumption ^ . 1 I that 9 — a/ (3, where gcd(a, j3) = 
1. For a given bitstream x\, . . . , x n , we define the distance sequence do, di, . . . , d n by the formula 

di = ((3 — a) ■ (number of ones in Xi, . . . , Xi) — a ■ (number of zeroes in x%, . . . , x{). 

In other words, di = (/3 — a)ri — a(i — ri) = /3ri — ai, where rj is the corresponding entry in the 
rank table. 
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With a little thought it can be seen that 64 is proportional to the distance from Lg of the 
point at the end of the zth step of the walk. This empowers the distance sequence with the 
following critical property: 

Lemma 3.4. The substring x a , ■ ■ ■ , x b has density equal to 9 if and only if d a -i — d\,. Similarly, 
the substring x a , . . . ,x b has density at least 9 if and only if d a _i < db- 

Proof. Although this follows immediately from the geometric argument above, we can also 
prove it directly. Using the formula di — firi — ai, we find that d a -\ = db if and only if 
P(rb — r a -i) = a(b — a + 1), or equivalently 

, ., , r b - r a _x a 

density of x a , . . . , x b = — = ~s = 9. 

— a + 1 p 

The argument for density > 9 is similar. □ 



3.2 The DistMap Algorithm 

With Lemma 13.41 we now have a simple solution to the fixed density problem. We compute the 
distance sequence do , • ■ • > d n as we pass through our bitstream, keeping track of which distances 
we have seen before and when we first saw them. Whenever we find that a distance has been 
seen before, we have a substring of density 9 and therefore a potential solution. 

We keep track of previously-seen distances using a key 1— > value map structure with worst- 
case O(logn) search and insertion, such as a red-black tree [5]. Here the key is a distance D 
that we have seen before, and the value is the position at which we first saw it (i.e., the smallest 
i for which di — D). 



procedure DistM AP(a;i, . . . ,x„, 9 = a/ f3) 




(a,6)<-(0,0) 


> Best start /end found so far 




> Current distance di 


Initialise the empty map m 




Insert m[0] 


> Record the starting point do — 


for i <— 1 to n do 




if Xi — 1 then 


> Compute the new distance di 


8 -s- 5 + (0 - a) 




else 




S ^— S — a 




if m has no key 5 then 


> Have we seen this distance before? 


Insert m[S] ^— i 


> No, this is the first time 


else 




if i — m[S] > b — a + 1 then 


> Yes, back at position m[S] 


(a, 6) <- (m[5] + 


> Longest substring found so far 


Output (a, 6) 





Figure 5: The DistMap algorithm for the fixed density problem 



The result is the algorithm DistMap, described in Figure[5] Given our choice of map structure, 
the following result is clear: 

Lemma 3.5. The algorithm DistMap solves the fixed density problem in 0(nlogn) time in the 
worst case. 
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We could of course use a hash table instead of a map structure — with a judicious choice 
of hash function this could yield O(n) expected time, though the worst case could potentially 
be much slower. Because we offer a worst-case 0(n) algorithm in Section |4l we do not pursue 
hashing any further here. 

3.3 The DistSort Algorithm 

We move now to a variant of DistMap that removes any need for a map structure at all. Instead, 
we replace this map with a simple array that we sort in-place after all n bits of the bitstream 
have been processed. The new algorithm is named DistSort, and has the following advantages: 

• Whilst the map structure plays a key role in giving us 0(n\ogn) running time, it also 
comes with a non-trivial memory overhead. If n is large and memory becomes a problem, 
the in-place sort used by DistSort may be a more economical choice. 

• DistMap relies on searching for precise matches d a _i = d& within the map structure. 
This makes it unsuitable for the bounded density problem, which requires only d a -i < cfc 
(Lemma 13. 4|) . If we replace our map with an array sorted by distance di, then both 
problems become easy to solve. Indeed, we find with DistSort that the solutions for the 
fixed and bounded density problems differ by just one line. 

The key ideas behind DistSort are as follows: 

• We walk through the bitstream and compute each distance di as we go, just as we did for 
DistMap. However, instead of storing distances in a map, we store each pair (di,i) in a 
simple array z[0..n], so that each array entry z[i] is the pair (di,i). 

• Once we have finished our walk through the bitstream, we sort the array z[0..n] by distance. 
This gives us a sequence of (distance, position) pairs 

(D ,P ) (D 1 ,P 1 ) ... (D n ,P n ), 

where Dq < D\ < . . . < D n and where each Di is the distance after the P^th step. 

• Finding positions with matching distances is now a simple matter of walking through 
the array from left to right — all of the positions with the same distance will be clumped 
together. In each clump we track the smallest and largest positions p m - m and p ma x) and 
these become a candidate substring %(p min +i), • ■ • , £p max with density 9. The longest such 
substring is then our solution to the fixed density problem. 

• Solving the bounded density problem is just as easy. The only difference is that we 
now need our substring x ( Pmin +i ),•••, £p max to satisfy d Pmin < d Pmax , not d Pmin = d Pmax . 
To achieve this, we simply change p m in from the smallest position in this clump to the 
smallest position in all clumps seen so far. 

The full algorithm is given in Figure [51 The fixed and bounded density algorithms differ by 
only one line (marked with a comment in bold) , where in the bounded case we do not reset p m i n 
upon entering a new clump of pairs with equal distances. 

Regarding time complexity, we can choose a worst-case Oin log n) sorting algorithm, such 
as the introsort algorithm of Musser [16]. The subsequent scan through the array runs in linear 
time, yielding the following overall result: 

Lemma 3.6. The algorithm DistSort solves both the fixed and bounded density problems in 
O(nlogn) time in the worst case. 
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procedure DistSort(:z;i, ...,x n , = a//3) 
Initialise an array z[0..n] of (dist,pos) pairs 

5 t> Current distance di 

z[0] 4— (0, 0) > Record the starting point do = 

for i <- 1 to n do 

if Xi — 1 then > Compute the new distance di 

S^S + (/3-a) 
else 

S «— S — a 

z[i] (S, i) > Store the pair (ck,i) in our array 

Sort z[0..n] by distance, giving a sorted sequence 
of pairs (D , Pi)) (D x ,Pi) ... (D n ,P n ) 

(a,b) <— (0,0) > Best start/end positions found so far 

(Pmin,Pmax) (Po , Po) > Potential start/end positions 
i <- 

while i < n do 

Pmin <— -Ps > Do this for the fixed density problem ONLY 

Pmax ^ Pi 

j i + 1 > Run through a clump of pairs with the same distance 

while i < n and Di — Di-i do 
if Pi < p min then 

Pmin Pi > A smaller position with this distance 

if Pi > p m ax then 

Pmax Pi > A larger position with this distance 

i <- i + 1 

if Pmax - Pmin > fe - a + 1 then 

(a, 6) *r- (pmin + 1, Pmax) > Longest substring found so far 

Output (a, b) 

Figure 6: The DistSort algorithm for the fixed and bounded density problems 



4 Solving the Fixed Density Problem 

We proceed now to an algorithm for the fixed density problem that runs in 0(n) time, even in 
the worst case. This algorithm uses DistMap as a starting point, but replaces the generic map 
structure with a specialised data structure for the task at hand. 

The central observation is the following. As we run the DistMap algorithm, each successive 
key in our map is always obtained by adding +(J3 — a) or —a to the previous key. We exploit 
this constraint to design a data structure that allows us to "jump" from one key to the next 
without requiring a full search, thereby eliminating the log n factor from our running time. 

The data structure is fairly detailed, making it difficult to give a simple overview. The 
following outline summarises the broad ideas involved, but for a clearer picture the reader is 
referred to the full description in Sections 14.11 and 14.21 The running time of 0(n) is established 
in Section [4.31 using amortised analysis. 

• We begin by arranging the integers into an infinite two-dimensional lattice (Figure [7]) , so 
that +(/? — a) represents a single step to the right and —a represents a single step down. 
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This makes moving from one key to the next a local movement within the lattice. This 
lattice has infinitely many columns but only f3 — a rows, so a step down from the bottom 
row wraps back around to the top (but with a shift). 



-6 


/ - ~ 3 


/ \0 


/ \3 




/: \ 






-11 


-8 


' \ 

-5 


~\ 
i — A- — i 


\ 1 
i — >V 1 


4 


7 






> > 1 


i > < 






> > 1 


> > 1 


> — y 


-16 


-13 


-10 


-7 


\ . - 4 . 


\ > 


2 





Figure 7: The two-dimensional lattice of integers for j3 — a — 3 and a = 5. 



• We now use this integer lattice as the "domain" of our map, so that keys (the distances 
di) become points in the lattice, and values (the corresponding positions i) are stored at 
these points. In this way our data structure becomes a matrix, which is sparse because 
only n points in the lattice correspond to "real" keys with non-empty values. 

• The next stage in our design is to "compress" this sparse matrix by storing not individual 
key i — v value pairs but rather horizontal runs of consecutive pairs, as illustrated in Figure[8l 
Storing just the start and end of each run allows us to completely reconstruct the missing 
keys and values in between. 

-8^7 -5>->8 -2^9 lh->10 4h->ll , v -8^7 4 >-> 11 



Figure 8: Compressing a horizontal run of consecutive pairs 



• We finish by developing a linked structure for storing our matrix. The compressed runs in 
each row are stored as a "horizontal" linked list, with additional "vertical" links between 
rows for downward steps. We also chain vertical links together, yielding a perfect bal- 
ance that offers enough information to support fast movement between keys, but enough 
flexibility to support fast insertion of new key H> value pairs. 

Before presenting the details, it becomes useful to strengthen our base assumptions as follows. 

Assumption 4.1. Recall from Assumption 12 . II that 6 = a/(3, where < a < (3 < n. From here 
onwards we strengthen this by assuming the stricter bounds < a < (3 < n. In other words, we 
explicitly disallow the special cases 9 = and 6 = 1. 

Like our earlier assumptions, this is not restrictive in any way. If 6 = or 6 = 1 then we 
simply require the longest continuous substring of zeroes or ones, which is trivial to find in linear 
time. 



4.1 The Mapping Matrix 

We begin the details with a formal definition of the integer lattice depicted in Figure [71 Recall 
from Assumptions l2.1l and l4.1l that both j3— a and a are strictly positive, and that gcd(/3— a, a) = 
1. 

Definition 4.2 (Lattice Coordinates). Let z be any integer. The lattice coordinates of z are 
the unique solutions (r, c) to the equation 

(/3 — a)c — ar = z, (1) 
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for which r and c are integers and < r < ft — a. We call r and c the row and column of z 
respectively. 

For example, consider Figure [7] in which /3 — a — 3 and a = 5. The following table lists the 
lattice coordinates of several integers z: 



Integer z 


-3 





3 


6 


1 


-4 


Lattice coordinates of z 


(o,-i) 


(0,0) 


(0,1) 


(0,2) 


(1,2) 


(2,2) 



These are precisely the locations at which each integer can be found in Figure [7J where we 
number the rows and columns so that the integer zero appears at coordinates (0, 0). 

With a little modular arithmetic it can shown that every integer appears once and only once 
in our lattice, as expressed formally by the following result. The proof is elementary, and we do 
not repeat it here. 

Lemma 4.3. Lattice coordinates are always well-defined, that is, equation ([JJj has a unique 
solution for every integer z. Moreover, every pair of integers (r, c) with < r < j3 — a forms 
the lattice coordinates of one and only one integer. 

It is worth reiterating a key feature of this construction, which is that each bit of the bitstream 
gives rise to a local movement within the lattice: 

Lemma 4.4. Consider some position i within the bitstream, where < i < n. Suppose that the 
lattice coordinates of the distance di are (r, c) . Then: 

• If the (i + l)th bit is a one, the lattice coordinates of the subsequent distance di+x are 
(r, c + 1). That is, we take one step to the right. 

• If the (i + l)th bit is a zero and r < (3 — a — 1, then the lattice coordinates of di+i are 
(r + 1, c). That is, we take one step down. 

• If the (i + l)th bit is a zero andr = f3 — a — 1 (i.e., we are on the bottom row of the lattice), 
then the lattice coordinates of di+i are (0, c — a). That is, we wrap back around to the top 
with a shift of a columns to the left. 

This is a straightforward consequence of Definitions 13.31 and 14 . 2 \ and again we omit the proof. 
The various movements described in this result are indicated by the solid lines in Figure [Jj 

Recall that our overall strategy is to build a replacement data structure for the generic 
key i — y value map, whose keys are distances di and whose values are the corresponding positions 
i in the bitstream. Using Lemma 14.31 we can replace each distance di with its lattice coordinates 
(r, c), thereby replacing the old mapping di ^ i with the new mapping (r, c) i— > i. This effectively 
gives us a matrix with j3 — a rows and infinitely many columns, which we formalise as follows. 

Definition 4.5 (Mapping Matrix). We define the mapping matrix to be an infinite matrix with 
precisely ft — a rows (numbered 0, . . . , ft — a — 1) and infinitely many columns in both directions 
(numbered ...,—1,0,1,...). Each cell of this matrix may contain an integer, or may contain the 
symbol representing an empty cell. The entry in row r and column c of the mapping matrix 
M is denoted M[r, c}. 

Our algorithm now runs as follows. As we process each bit of the bitstream, we walk through 
the cells of the mapping matrix as described by Lemma 14.41 If we step into an empty cell, we 
store the current position in the bitstream. If we step into a previously-occupied cell then we 
have found a substring of density 9. 

The full pseudocode is given in Figure|Hl under the algorithm name DistMatrix. The algorithm 
is of course remarkably similar to DistMap (Figure[5]), since the key difference is in the underlying 
data structure. Our focus in Section [4.2l is now to fully describe this data structure, and thereby 
describe the critical tasks of evaluating and setting the matrix entry M[r, c\. 
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procedure DistMatrix(xi, . . . ,x n , 9 = ct/j3) 




(a,&)<-(0,0) 


> Best start /end found so far 


(r, c) <- (0, 0) 


> Current location in the matrix 


Initialise the empty mapping matrix M 




Insert M[0, 0] <- 


> Record the starting point do = 


for i <— 1 to n do 




it ii = 1 then 




c <— c + 1 


> Step right 


else ifr</3 — a — 1 then 




r <— r + 1 


t> Step down 


else 




(r,c) <- (0,c-a) 


> Step down and wrap around 


if Af[r, c] = then 


> Have we been here before? 


Insert M[r, c] i 


> No, this is the first time 


else 




if i - M[r, c] >b- a+ 1 then 


> Yes, back at position M[r, c] 


(a, b) <- (M[r, c] + l,i) 


> Longest substring found so far 


Output (a, b) 





Figure 9: The DistMatrix algorithm for the fixed density problem 



4.2 The Data Structure 

We cannot afford to store the mapping matrix as a two-dimensional array, because — even ig- 
noring the infinitely many columns — there are 0(n 2 ) potential cells that a bitstream of length 
n might reachU However, only n + 1 cells are visited (and hence non-empty) for any particular 
input bitstream. That is, the mapping matrix is sparse. 

We therefore aim for a linked structure, where only the cells we visit are stored in memory, 
and where these cells include pointers to nearby cells to assist with navigation around the matrix. 

However, before describing this linked structure we introduce a form of compression, where 
we only need to store the cells involved in downward steps. As we will see in Section |4~3| this 
compression is critical for stepping through the matrix in amortised constant time. 

Our compression relies on the observation that a run of k consecutive steps to the right 
produces a sequence of k consecutive values in the matrix: 



1 



We can describe such a sequence by storing only the start and end points, without having to 
store each individual cell in between. 

This pattern becomes more complicated when new paths through the matrix cross over old 
paths, but the core idea remains the same — we look for horizontal runs of consecutive values in 
the matrix, and record only where they start and end. Figure [10] gives an example, where four 
different paths from four different sections of the bitstream cross through the same row of the 
matrix. 



• Figure 10(a) shows the four paths, which are labelled A, B, C and D in chronological 
order as they appear in the bitstream. For instance, path A enters the row at cell (8,1) 
and position 10 in the bitstream, takes two steps to the right, and exits the row from cell 



2 This of course depends upon the value of 9. If 9 = ^ for instance, then there are only In + 1 potential cells 
and a more direct linear algorithm becomes possible. Here we treat the general case < a < < n. 
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D (pos 70) B (pos 30) A (pos 10) 
Column: —5 



Row: 8 



C (pos 50) 
5 9 



V V V V 

B (pos 33) ,4 (pos 12) C (pos 50) D (pos 

(a) Several paths that cross through a single matrix row 



Column: 
Row: 8 



70 | 71 | 72 | 30 | 31 | 32 | 10 | 11 | 12 | 79 | 80 | 81 | 50 | 83 | 84 

-J- -J- -J- 4- 

(b) The corresponding values in the mapping matrix 



Cell 


Value in this cell 


Value to start this run 


(8,-5) 


70 


70 


(8,-2) 


30 


30 


(8,1) 


10 


10 


(8,3) 


12 


78 


(8,7) 


50 


82 


(8,9) 


84 






(c) Storing these values in memory 

Figure 10: Compressing a row of the mapping matrix 



(8, 3) at position 12 in the bitstream. Note that path B subsequently exits from the same 
cell that A entered, and that path C includes no rightward steps at all. 



Figure [T0(b)| shows the state of the mapping matrix after all four paths have been followed. 
Note that values from older paths take precedence over values from newer paths, since we 
always record the first position at which we enter each cell. Vertical arrows are included 
as reminders of the cells at which paths enter and exit the row. 



• Figure 10(c) shows how this state can be "compressed" in memory. We only store cells 
at which paths enter and exit the row, and for each such cell (r, c) we record the following 
information: 

— The value stored directly in that cell, i.e., M[r, c]; 

— The value that "begins" the horizontal run to the right, i.e., M[r, c + 1] — 1. 

If the cell (r, c) is itself part of the run (such as (8, —5), (8, —2) and (8, 1) in our example) 
then both values will be equal. If the cell (r, c) is the exit for an older path (such as (8, 3) 
or (8, 7) in our example) then these values will be different. If there is no run to the right 
(as with (8,9) in our example) then we store the symbol 0. 

We collate this information into a full linked data structure as described below in Data Struc- 
ture 14.61 A detailed example of this linked structure is illustrated in Figure QT] 
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Column 

— oo 



Row r 
Row r + 1 
Row r + 2 



Column 

+00 



• Entry / exit cell j Vertical link 
^ Horizontal links - Secondary link 

Figure 11: An illustration of the full linked data structure 



Data Structure 4.6 (Mapping Matrix). Suppose we have processed the first k bits of our 
bitstream. To store the current state of the mapping matrix, we keep records in memory for the 
following cells: 

• The entry and exit cells in each row, i.e., cells that correspond to positions immediately 
before or after a zero bit; 

• The two cells corresponding to the beginning of the bitstream and our current position; 

• "Sentinel" cells (r, —00) and (r, +00) in each row. 

The record for each such cell (r, c) contains the following information: 

• The column c; 

• The values M[r, c] and M[r, c+1] — 1 as described above, where for the sentinels (r, ±00) 
these values are 0; 

• Links to the previous and next cells in the same row (called horizontal links,). 
In addition, if we have previously stepped down from this cell then we also store: 

• A link to the endpoint of this step in the following row ( called a vertical link ), where this 
endpoint is (r + 1, c) or (0, c — a) according to whether or not r < (3 — a — 1; 

• A link that jumps to the next vertical link in this row, that is, a link to the nearest cell to 
the right that also stores a vertical link (we call this new link a secondary linkj. 

We also insert vertical links between the sentinels at (r, ±00), running from each row to the 
next, and join these into the chains of secondary links for each row. 

To summarise: (i) the "interesting" cells in each row are stored in a horizontal doubly-linked 
list, (ii) we add vertical links corresponding to previous steps down, and (iii) we chain together 
the vertical links from each row into a secondary linked list. 

We return now to fill in the missing parts of the DistMatrix algorithm (Figure [9]), namely the 
evaluation and setting of the matrix entry M[r, c]. This can be done as follows. 

(i) At all times we keep a pointer to the current cell in the matrix (which, according to Data 
Structure I4.6[ always has a record explicitly stored) . 



13 



(ii) Each time we step right or down, we adjust the data structure to reflect the new bit that 
has been processed, and we move our pointer to reflect the new current cell. 

(iii) Evaluating and setting M[r, c] then becomes a simple matter of dereferencing our pointer. 

The only step that might not run in constant time is ([nj, where we adjust the data structure 
and move our pointer. The precise work involved varies according to which type of step we take. 

• Step right (processing a one bit): This is a local operation involving no vertical or secondary 
links. We might need to extend the endpoint of the current horizontal run or start a new 
run from the current cell, but these are all simple constant time adjustments involving 
only the immediate left and right horizontal neighbours. 

• Step down (processing a zero bit): This is a more complex operation that uses all three 
link types. Suppose that we begin the step in cell (r,c); for convenience we assume that 
we step down to (r + 1, c), but the wraparound case r — j3 ~ a — lis much the same. If 
there is already a vertical link (r, c) — > (r + 1, c) then we simply follow it. Otherwise we 
do the following: 

(1) Find where the destination cell (r + 1, c) should be inserted in the horizontal list for 
row r + 1 (or And the cell itself if it is already explicitly stored) . We do this by: 

— walking back along row r until we And the nearest vertical link to the left, which 
we denote L_; 

— following the secondary link from L_ to the nearest vertical link to the right, 
which we denote L + ; 

— following the link L + down to row r + 1; 

— walking back along row r + 1 until we find our insertion point. 



Row r 



Row r + 1 



(a) The neighbourhood of the source cell (r, c) 



V 



(r, c) 

— O 



o 



(r + l,c) 
(b) The path from (r, c) to (r+1, c) 



Row r 



Row r + 1 



\ (r,c) 



L- 



L 



| | (r + l,c)| 

(c) The new vertical and secondary links 



Figure 12: Stepping down from (r, c) to (r + l,c) 



This series of movements is illustrated in Figure 12(b) Note that our sentinels at 



(r, ±oo) ensure that the vertical links L_ and L + will always exist. 

(2) If required, insert the cell (r + 1, c) into the horizontal list for row r+1 and update 
its immediate horizontal neighbours. 
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(3) Insert the new vertical link (r, c) — > (r + 1, c), which we denote L . 

(4) Replace the secondary link L_ — » L + with two secondary links L_ — >• Lq — > L + , 



illustrated in Figure 12(c) 



Operations <j2j) , ([3]) and (|4|) are all constant time operations, but operation (Q} may involve 
a lengthy walk through the data structure. The reason for the convoluted path (and indeed the 
secondary links) is because by walking backwards along each row we can ensure that operation ([TJ 
runs in amortised constant time, as shown in the following section. 

4.3 Analysis of Running Time 

Through the discussions of the previous section, we find that — with the single exception of the 
walk from (r, c) to (r+1, c) when we step down in the mapping matrix — each bit of the bitstream 
can be processed in constant time. The following lemma shows that these exceptional walks can 
be processed in amortised constant time, giving DistMatrix an overall running time of 0(n). 

As in the previous section, we assume that we step down from (r, c) to (r+1, c); the arguments 
for the wraparound case r = (3 — a — l are essentially the same. It is also important to remember 
that the phrases step down and step right refer to the full movement when processing some bit of 
the bitstream, and not the many different links that we might follow through the data structure 
in performing such a step. 

Lemma 4.7. Consider the walk from cell (r, c) to (r + l,c) in the "step down" phase of the 
DistMatrix algorithm, as illustrated in Figure \l2(b)\ and define the length of this walk to be the 
total number of links that we follow. After processing the entire bitstream, the sum of the lengths 
of all "step down" walks is 0(n). In other words, each such walk can be followed in amortised 
constant time. 

Proof. We prove this result using aggregate analysis, by "counting" the number of links in each 
walk using a rough upper bound. The following links are excluded from this count: 

• all vertical and secondary links; 

• the leftmost horizontal link on each row of each walk; 

• any horizontal links that end at the starting point (0, 0); 

• any horizontal links that end at the current cell (r, c). 

Figure H3] shows a sample walk where the excluded links are marked with dotted arrows, and the 
remaining links (all horizontal) are marked with bold solid arrows. It is clear that we exclude 
0(n) links in totaljf) and so if we can show that at most 0(n) horizontal links remain then the 
proof is complete. 

(r,c) 

— •< m<rO 

L- • /. . 



V 



(r + l,c) 

Figure 13: Excluded links in a "step down" walk 

Within each walk from (r, c) to (r + 1, c), the horizontal links that remain have the following 
critical properties: 



3 A horizontal link ending at (0,0) can occur at most twice per walk (and at most once if ft — a > 1). A 
horizontal link ending at (r, c) can occur at most once per walk, and only in the special case /3 — a = 1. 
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• The endpoint of each link in row r is also the endpoint of some earlier step down. Moreover, 
this earlier step down was followed immediately by a succession of steps right that reached 
at least as far along the row as (r, c) . 

• The endpoint of each link in row r + 1 is also the beginning of some earlier step down. 
Moreover, this earlier step down was preceded immediately by a succession of steps right 
that originated at least as far back along the row as (r + 1, c). 



^ IS 



r, c) 



(r + l,c) 



t ~ ~i~ ~ i — r ~ i — i 



Figure 14: Earlier successions of steps associated with the remaining links 

These properties are a consequence of our compression (recall that each non-sentinel cell 
that we store is either (0,0), the current cell, a row entry or a row exit), as well as the fact 
that there are no vertical links between L_ and L+ that join row r with row r + 1. Figure IT4l 
illustrates the successions of rightward steps that are described above. 

We can now associate each remaining link £ with a position ir(£) in the bitstream: 

• If the link i is on the "upper" row r, consider the oldest sequence of steps that stepped 
down to the endpoint of £ and then right all the way across to (r, c) , as illustrated in 



Figure 15(a) We define tt(£) to be the position in the bitstream that was reached by this 



sequence when it passed through the cell (r, c). Note that < < n. 

• If the link I is on the "lower" row r + 1, consider the oldest sequence of steps that stepped 
right from (r + 1, c) all the way across to the endpoint of i and then down, as illustrated 
in Figure |15(b)| We define n(£) to be the position in the bitstream that was reached by 
this sequence when it passed through the cell (r + 1, c), negated so that — n < ir(£) < 0. 



Sequence of 
steps 



i r i rh r i i l [ 
i«i • • r»-l , r»-r 



r, c) 



(r+l,c) 



V Sequence of 
steps 



(a) If £ is on the upper row 



(b) If I is on the lower row 



Figure 15: The earlier sequence of steps that defines n(£) 



The key to achieving an 0(n) total of walk lengths is to observe that the function 7r is 
one-to-one: 

• A link l\ on the upper row of some walk can never have the same value of 7r as a link £ 2 
on the lower row of some (possibly different) walk, since 7r(^ 2 ) < < n(£i). 

• Within a single walk: 

— The values n(£) for links £ on the upper row r are distinct, because each corresponds 
to a different historical path through (r, c) , with a different initial entry point into 
row r. 
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— Likewise, the values for links i on the lower row r + 1 are distinct, because each 
corresponds to a different historical path along row r + 1 with a different final exit 
point from row r + 1. 

• Between different walks: 

— Because we insert a new vertical link after every walk, each walk must have a distinct 
starting point (r,c). The values tt(£) from the upper rows of different walks are 
therefore distinct because they correspond to positions in the bitstream for distinct 
cells (r, c). 

— Likewise, the values ir(£) from the lower rows of different walks are distinct because 
they correspond to positions in the bitstream for distinct cells (r + 1, c). 

Therefore tt is a one-to-one function. Because n(£) £ {—n, —n + 1, . . . , n — 1, n}, it follows 
that the number of links I in the domain of the function can be at most 2n + 1. Hence there 
are 0(«) horizontal links remaining that we have not excluded from our count, and the proof is 
complete. □ 

Through Lemma [4771 we now find that each bit of the bitstream can be completely processed 
in amortised constant time, yielding the following final result: 

Corollary 4.8. The algorithm DistMatrix solves the fixed density problem in O(n) time in the 
worst case. 



5 Solving the Bounded Density Problem 

We finish our suite of algorithms with a linear time solution to the bounded density problem, 
improving upon the log-linear DistSort algorithm of Section [3J Unlike our linear time solution 
to the fixed density problem, this algorithm is simple to express, uses no sophisticated data 
structures, and essentially involves just a handful of linear scans. 

Once again we base our new algorithm on the distance sequence do, . . . ,d n . Recall from 
Lemma 13.41 that we seek the longest substring x a , . . . , Xb in the bitstream for which d a _i < d^. 
We begin with the following simple observation: 

Lemma 5.1. Suppose that x a , . . . , £& is the longest substring of density > 9 in our bitstream. 
Then there is no i < a — 1 for which di < e? a ~i, and there is no i > b for which di > d^. 

The proof is simple — if there were such an i, then we could extend our substring to position 
i and obtain a longer substring with density > 9. This result motivates the following definition: 

Definition 5.2 (Minimal and Maximal Position). Let A: be a position in the bitstream, i.e., 
some integer in the range < k < n. We call k a minimal position if there is no i < k for which 
di < dk , and we call k a maximal position if there is no i > k for which di > dk ■ 

Figure [5] plots the distance sequence for the bitstream 1001101001011 with target density 
9 = a/p = 3/5, and marks the minimal and maximal positions on this plot. 
Minimal and maximal positions have the following important properties: 

• They are the only positions that we need to consider. That is, the solution to the bounded 
density problem must be a substring x a , . . . , Xf, for which a — 1 is a minimal position and 
b is a maximal position (Lemma 15. ip . 

• They are simple to compute in 0(n) time. To find all minimal positions, we simply walk 
through the distance sequence do, . . . , d n and collect positions i for which di is smaller than 
any distance seen before. To find all maximal positions, we walk through the distance 
sequence in reverse (d n , . . . , do) and collect positions i for which di is larger than any 
distance seen before. 
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Distance d; 




Position i 



Figure 16: Minimal and maximal positions for the bitstream 1001101001011 

• They are ordered by distance. That is, if the minimal positions are 0,1,0,2, ■ ■ ■ ,a p from 
left to right (ai < 02 < . . . < a p ) then we have d ai > d a2 > . . . > d a . Likewise, if the 
maximal positions are b%, &2j • • • } b q from left to right (61 < 62 < • ■ • < b q ) then we have 
db ± > <4 2 > . . . > di, q . This is an immediate consequence of Definition 15.21 

Our algorithm then runs as follows: 

1. We compute the distance sequence in 0(n) time, by incrementally adding — a) or —a 
as seen in DistMap and DistSort. 

2. We compute the minimal positions a\, a.2, . . . , a p and the maximal positions b\, &2> • • • , b q 
in 0(n) time as described above. 

3. For each minimal position a iy we find the largest maximal position bj for which d ai < d^. 
This gives a substring of density > 9 and length bj — ai + 1, and we compare this with the 
longest such substring found so far. 

The key observation is that, because minimal and maximal positions are ordered by distance, 
step [3] can also be performed in 0(n) time. Specifically, if the minimal position ai is matched 
with the maximal position bj, then the next minimal position a^+i will be matched with an 
equal or later maximal position, i.e., one of bj, &7+1, . . . ,b q . We can therefore keep a pointer into 
the sequence of maximal positions and slowly move it forward as we process each of 01, . . . , a p , 
giving step [3] an 0{n) running time in total. 

We name this algorithm PositionSweep; see Figure [T7] for the pseudocode. Through the 
discussion above we obtain the following final result: 

Lemma 5.3. The algorithm PositionSweep solves the bounded density problem in 0{n) time in 
the worst case. 

6 Measuring Performance 

We finish this paper with a practical field test of the different algorithms for the fixed den- 
sity problem^ In particular, because the linear DistMatrix algorithm involves a complex data 

4 We omit the bounded density problem from this field test because the linear algorithm PositionSweep is 
simple and slick, with neither the complexity nor the potential overhead of DistMatrix. 
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procedure PositionSweep(h, . . . , x„, 
d «- 

for i <— 1 to n do 
if Xi — 1 then 

di <- dj-i + (/3 - a) 
else 

di <— dj-i — a 


9 = a/0) 


> Compute the distance sequence 


p «- 1 ; ai <- 
for i f- 1 to n do 
if ^ < d ap then 

p <- p + 1 ; a p <— i 

q <— 1 ; bi n 
for i <— n — 1 downto do 
if di > dt then 

q «- q + 1 ; 6 9 <- « 




> Compute minimal positions 

> Compute maximal positions 


(a, 6) ^- (0,0) 

for i <— 1 to p do 

while j < q and d a; < d6j +1 do 

3 <~ 3 + 1 
if 6j — ai > 6 — a + 1 then 

(a,b) <r- (at + l,bj) 




> Best start /end found so far 




> Run through minimal positions 
> Find best maximal position 

> Longest substring found so far 


Output (a, 6) 







Figure 17: The PositionSweep algorithm for the bounded density problem 



structure with potentially significant overhead, it is useful to compare its practical performance 
against the log-linear but much simpler algorithms DistMap and DistSort. The tests are designed 
as follows: 

• We use bitstreams of length n = 10 s for all tests. This value of n was chosen to be large but 
manageable. We keep n fixed merely to simplify the data presentation — additional data has 
been collected for several smaller values of n, and the results show similar characteristics 
to those described here. 

• All bitstreams are pseudo-random0 This is of particular benefit to the SkipMisMatch 
algorithm, whose expected running time of 0(n log n) in a random scenario is significantly 
better than its worst case time of 0(n 2 ). 

• We run tests with several different values of the target density 9. This includes values 
close to and far away from i, as well as values with small and large denominators — our 
aim is to identify to what degree the performance of different algorithms depends upon 8. 
The values of 8 that we use are |, |, |, |, ygy, and j^-. 

• Each test involves the same 200 pre-generated bitstreams of length n = 10 s . For each 
algorithm and each value of 8 we measure the mean running time over all 200 bitstreams. 
All running times are measured as user + system time, running on a single 3 GHz Intel 
Core 2 CPU with 4GB of RAM. All algorithms are coded in C++ under GNU/Linux. 



5 Bitstreams were generated using the randO function from the Linux C Library. 
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Comparison of running times for n = 100,000,000 



Q) 

3r O 



o SkipMisMatch 

- * DistMap 

- x DistSort 
DistMatrix 




Figure 18: Running times of different algorithms for the fixed density problem 



The results are plotted in Figure [THJ note that the time axis uses a log scale, with each 
horizontal line representing a factor of approximately x3. Error bars are not included because 
most standard errors are within ±1%; the only exceptions are for 6 = i, where DistMap has a 
standard error of ±1.6% and SkipMisMatch has a standard error of ±10%. The values of 9 are 
ordered by distance from i. 

Happily, the results are what we hope for. The log-linear algorithms DistMap and Dist- 
Sort perform significantly better than SkipMisMatch in most cases, and the linear algorithm 
DistMatrix consistently outperforms all of the others. 

The dependency of SkipMisMatch upon 9 is evident — performance is best when both \9 — || 
and the denominator j3 are large (as expected from Lemma l2.3p . bringing it close to the 4 second 
running time of DistMatrix for the extreme case 9 = j^-. At the other extreme, for 6 = | the 
SkipMisMatch algorithm runs orders of magnitude slower, with a mean running time of over an 
hour and some individual cases takingup to 10^ hours. 

Amongst the log-linear algorithms^], we find that DistSort performs noticeably better than 
DistMap. Part of the reason is the memory overhead due to the map structure — it was found 
that DistMap often exceeded the available memory on the machine, burdening it with a reliance 
on virtual memory (which of course is much slower). The linear DistMatrix algorithm also suffers 
from memory problems to a lesser extent, but Figure [18] shows that that the effectiveness of 
the algorithm more than compensates for this. Figure [19] plots the peak memory usage for each 
algorithm, again averaged over all 200 bitstreams. 

An interesting feature of the running times is that DistMap depends upon 9 in an opposite 
manner to SkipMisMatch. This is because when 9 ~ ^ or the denominator /3 is small, there are 
fewer distinct distances amongst d , . . . ,d n , and hence fewer elements stored in the map. 

6 For DistMap and DistSort, the map and sort are implemented using std: :map and std: :sort from the C++ 
Standard Library, as implemented by the GNU C++ compiler version 4.3.2. 
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Peak memory usage for n = 100,000,000 
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Figure 19: Peak memory usage of different algorithms for the fixed density problem 



In conclusion, it is pleasing to note how consistently DistMatrix performs across all of the 
tested values of 8, with mean running times ranging from 3.2 seconds to 4.2 seconds and stan- 
dard errors of just 0.1%. The experiments therefore suggest that the added complexity and 
overhead of DistMatrix are well justified by the efficiency of the algorithm and its underlying 
data structure. 
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